Let’s load some data from Experiment 2 of Truncating the Y-Axis: Threat or Menace? The experiment compares whether different visualization designs signaling that the axis has been truncated can counteract distortions in perceived differences between bars in a bar chart. Participants judge the magnitude of differences (i.e., effect size) on a 5-point Likert-style scale.
The data is at https://osf.io/5pjgb/, and the paper is at https://arxiv.org/pdf/1907.02035.pdf.
data <- read_csv("cleanrows2.csv")
## Parsed with column specification:
## cols(
## visType = col_character(),
## framing = col_character(),
## dataSize = col_double(),
## truncationLevel = col_double(),
## trendDirection = col_double(),
## slope = col_double(),
## id = col_character(),
## index = col_double(),
## training = col_double(),
## rt = col_double(),
## timestamp = col_character(),
## qTrend = col_double(),
## qSeverity = col_double(),
## correct = col_double(),
## firstX = col_double(),
## lastX = col_double(),
## x0 = col_double(),
## x1 = col_double(),
## x2 = col_double()
## )
Let’s preprocess our data for the purpose of modeling. This mostly involves renaming some variables and/or casting some of them as factors. We also want to drop training trials from our data as Correll et al. did in the paper.
data <- data %>%
mutate(
vis = as.factor(visType),
# truncation = as.factor(truncationLevel),
# slope = as.factor(slope),
complexity = as.factor(dataSize),
subject = as.factor(id)) %>%
rename(
response = qSeverity,
truncation = truncationLevel
) %>%
filter(is.na(training))
We’re going to fit an ordinal regression model using the brms package. You can read about this kind of model and brms at https://osf.io/gyfj7/download. Basically, the model assumes that the ordinal response on a K-point scale is generated by partitioning a latent (unobserved) continuous variable using K - 1 thresholds. The latent scale represents the participant’s mental representation of perceived effect size in arbitrary units. The thresholds separating adjacent response categories are intercepts in our model, and their location on the latent scale is modified by each fixed and random effect parameter in our model.
The model specification we’ve chosen will enable us to make inferences about the impact of different visualization designs and different levels of axis truncation, controlling for the actual maginitude of the difference participants are judging, the number of bars shown, and individual variability in responses. Note that in addition to modeling the threshold locations on the latent scale, we are also modeling the variability in the latent scale (i.e., ability to discriminate the underlying signal), allowing it to vary per participant.
We’ll fit the model using the default priors in brms.
m1 <- brm(
formula = bf(
response ~ vis + truncation*slope + complexity + (1|subject),
disc ~ (1|subject)),
family = cumulative("probit"),
chains = 2,
cores = 2,
iter = 2500,
warmup = 1000,
data = data,
control = list(adapt_delta = 0.99),
file = "ordinal-model1")
Let’s check some diagnostics.
In this summary table, we want to see Rhat values very close to 1.0 and Bulk_ESS in the thousands for proper inference. Rhat is a measure of chain mixing. We want to know that our chains are not getting stuck in local maxima of the parameter space in terms of log joint posterior densitity (i.e., what is being optimized in the sampling algorithm). When chains are getting stuck at local maxima, this is usually a sign that our parameter space is “lumpy” in its geometry, and we should probably reparameterize our model.
Bulk_ESS is effective sample size. This tells us how many functionally independent samples from the posterior we actually have. For joint posteriors with strong correlation between parameters, this can we quite a lot lower than the number of MCMC iterations after warmup. We can bump Bulk_ESS higher by running our model fitting process for more samples.
summary(m1)
## Family: cumulative
## Links: mu = probit; disc = log
## Formula: response ~ vis + truncation * slope + complexity + (1 | subject)
## disc ~ (1 | subject)
## Data: data (Number of observations: 1152)
## Samples: 2 chains, each with iter = 2500; warmup = 1000; thin = 1;
## total post-warmup samples = 3000
##
## Group-Level Effects:
## ~subject (Number of levels: 32)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 10.08 4.10 4.30 19.63 1.01 345 454
## sd(disc_Intercept) 0.28 0.05 0.18 0.39 1.00 885 1416
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 24.02 9.54 10.39 46.86 1.01 318 426
## Intercept[2] 44.54 17.33 19.97 86.92 1.01 313 428
## Intercept[3] 63.12 24.52 28.59 123.69 1.01 313 406
## Intercept[4] 81.20 31.58 36.51 160.67 1.01 313 401
## disc_Intercept -2.40 0.38 -3.15 -1.66 1.01 314 437
## visbottomgradbar -0.77 1.08 -3.20 1.08 1.00 1636 1049
## visbrokenbar 0.22 1.04 -1.89 2.28 1.00 1717 1211
## truncation 35.91 15.66 14.37 73.57 1.01 367 453
## slope 169.82 67.01 75.76 329.80 1.01 323 398
## complexity3 -3.13 1.52 -6.98 -1.04 1.00 467 514
## truncation:slope 18.38 35.98 -45.74 99.04 1.00 1645 1227
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Trace plots help us look at chain mixing in greater detail. This can be helpful for diagnosing exactly which parameters the sampler is getting stuck on.
plot(m1)
Pairs plots show us correlation between pairs of parameters. Perfectly correlated parameters are usually bad news. This means that our model cannot differentiate the effect of two redundant parameters, often because of issues of identifiability. This perfect correlation between parameters is called multicollinearity, and it usually means we need to reparameterize our model.
I usually separate these out into coherent sets of parameters since these plots don’t scale well.
pairs(m1, pars = c("b_Intercept[1]","b_Intercept[2]","b_Intercept[3]", "b_Intercept[4]", "b_disc_Intercept"), exact_match = TRUE)
Some of these intercepts look pretty strongly correlated, but it doesn’t seem to have caused any other issues with the model fit.
pairs(m1, pars = c("sd_subject__Intercept", "sd_subject__disc_Intercept"), exact_match = TRUE)
pairs(m1, pars = c("b_truncation", "b_slope"))
pairs(m1, pars = c("b_vis", "b_complexity"))
Let’s compare the empirical distribution of responses to the distribution predicted by the model.
Here’s the empirical distribution.
data %>%
mutate(rating = ordered(response, levels = c("5","4","3","2","1"))) %>%
ggplot(aes(x = vis, fill = rating)) +
geom_bar() +
theme_minimal() +
facet_grid(. ~ truncation)
Here’s the predicted distribution. Note that the model produces many draws (i.e., samples) consistituting a predictive distribution for each observation in the data set. In order to make the counts match, we take the model’s median response on each trial.
data %>%
add_predicted_draws(m1, seed = 14) %>%
group_by(vis, truncation, index, subject) %>%
summarize(rating = ordered(as.factor(median(as.numeric(.prediction))), levels =c("5","4","3","2","1"))) %>%
ggplot(aes(x = vis, fill = rating)) +
geom_bar() +
theme_minimal() +
facet_grid(. ~ truncation)
Notice how the average model prediction for each trial errs on the side of underestimating responses. This is just an artifact of taking the median of the distribution of predictions for each trial.
Now, instead of taking a median, let’s plot the predictive uncertainty of the model as animated hypothetical outcome plots (HOPs).
spec <- data %>%
add_predicted_draws(m1, seed = 14, n = 50) %>%
mutate(rating = ordered(.prediction, levels = c("5","4","3","2","1"))) %>%
ggplot(aes(x = vis, fill = rating)) +
geom_bar() +
theme_minimal() +
facet_grid(. ~ truncation) +
transition_manual(.draw)
animate(spec, fps = 2.5)
## nframes and fps adjusted to match transition
This shows us in detail how uncertain the model is about the distribution of responses in each condition. However, this overview might not reveal potential issues like skewed distributions of predictions. To do a more granular posterior predictive check, let’s compare observed and predicted ratings for individual subjects.
spec <- data %>%
add_predicted_draws(m1, seed = 14, n = 50) %>%
mutate(
observed = ordered(response, levels = c("1","2","3","4","5")),
predicted = ordered(.prediction, levels = c("1","2","3","4","5"))) %>%
pivot_longer(cols = c("observed", "predicted"), names_to = "source", values_to = "rating") %>%
ggplot(aes(x = rating, fill = source)) +
geom_bar(position = position_dodge()) +
theme_minimal() +
facet_wrap(. ~ subject) +
transition_manual(.draw)
animate(spec, fps = 2.5)
## nframes and fps adjusted to match transition
In this chart, we can see that our model predictions skew toward extreme ratings compared to a subset of participants who do not utilize the whole response scale (i.e., all possible responses 1-5). This may or may not be an issue depending on whether we want to use the model to predict behavior at the individual or group level.
Neither of these posterior predictive checks give us much sense of inferential uncertainty. One common way that researchers tend to make inferences with Likert-style responses is to model differences in average response. We can do the same thing with the output of the ordinal regression model, and we can do so without assuming that these discrete responses are the outcome of an untransformed linear model (e.g., ANOVA).
Let’s find the average response in each condition, conditioning on the average user (i.e., no longer including random effects in our predictions). Note that now we are taking the average rating in each condition for each draw, giving us a distribution of inferential uncertainty about average responses (i.e., a sampling distribution of average response) rather than a wider distribution of predictive uncertainty about non-aggregated responses.
data %>%
add_predicted_draws(m1, seed = 14, re_formula = NA) %>%
group_by(vis, truncation, .draw) %>%
mutate(rating = weighted.mean(as.numeric(.prediction))) %>%
ggplot(aes(x = vis, y = rating)) +
stat_eye() +
theme_minimal() +
coord_cartesian(ylim = c(1.0, 3.5)) +
facet_grid(. ~ truncation)
Let’s look at contrasts for the effect of visualization design
data %>%
add_predicted_draws(m1, seed = 14, re_formula = NA) %>%
group_by(vis, .draw) %>%
summarize(rating = weighted.mean(as.numeric(.prediction))) %>%
compare_levels(rating, by = vis) %>%
rename(diff_in_rating = rating) %>%
ggplot(aes(x = diff_in_rating, y = vis)) +
stat_halfeyeh() +
geom_vline(xintercept = 0, linetype = "longdash") +
theme_minimal()
One nice thing about Bayesian analysis is that we can say that we do not find a reliable difference between visualization conditions without all of that “failing to reject the null” business.
And let’s also look at contrasts for the effect of axis truncation.
data %>%
add_predicted_draws(m1, seed = 14, re_formula = NA) %>%
group_by(truncation, .draw) %>%
summarize(rating = weighted.mean(as.numeric(.prediction))) %>%
compare_levels(rating, by = truncation) %>%
rename(diff_in_rating = rating) %>%
ggplot(aes(x = diff_in_rating, y = truncation)) +
stat_halfeyeh() +
geom_vline(xintercept = 0, linetype = "longdash") +
theme_minimal()
These differences are really big, but we could see also see this in a few of the previous charts. Axis limits have a pretty robust influence on perceived effect size.
Let’s take a model that doesn’t fit and try to diagnose what’s wrong. This model uses a more complicated interaction term than the last one by assuming that complexity (number of bars) interacts with visualization design, truncation, and slope (effect size).
m2 <- brm(
formula = bf(
response ~ vis*truncation*slope*complexity + (1|subject),
disc ~ (1|subject)),
family = cumulative("probit"),
chains = 2,
cores = 2,
iter = 2500,
warmup = 1000,
data = data,
control = list(adapt_delta = 0.99),
file = "ordinal-model2")
We can see from the summary that we had divergent chains (Rhat > 1.1).
summary(m2)
## Warning: The model has not converged (some Rhats are > 1.1). Do not analyse the results!
## We recommend running more iterations and/or setting stronger priors.
## Family: cumulative
## Links: mu = probit; disc = log
## Formula: response ~ vis * truncation * slope * complexity + (1 | subject)
## disc ~ (1 | subject)
## Data: data (Number of observations: 1152)
## Samples: 2 chains, each with iter = 2500; warmup = 1000; thin = 1;
## total post-warmup samples = 3000
##
## Group-Level Effects:
## ~subject (Number of levels: 32)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 10436.07 4848.06 2178.07 19675.23 1.32 5 18
## sd(disc_Intercept) 0.27 0.05 0.17 0.38 1.04 61 160
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI
## Intercept[1] 17729.00 8980.76 3429.33
## Intercept[2] 37478.61 16959.18 8557.42
## Intercept[3] 56016.83 24851.76 12923.72
## Intercept[4] 74057.16 32516.51 16917.31
## disc_Intercept -9.23 0.55 -9.97
## visbottomgradbar -1955.18 5823.15 -12526.13
## visbrokenbar 2484.27 6806.93 -7457.49
## truncation 28691.35 17235.30 1474.18
## slope 151410.02 68959.10 34821.93
## complexity3 -1516.24 3384.42 -7848.49
## visbottomgradbar:truncation -4528.49 16117.07 -36943.20
## visbrokenbar:truncation -5783.75 20466.11 -66440.86
## visbottomgradbar:slope 17010.84 29916.62 -31585.23
## visbrokenbar:slope -3802.45 31598.95 -62575.73
## truncation:slope 83948.34 63021.18 -9934.63
## visbottomgradbar:complexity3 -2997.09 7009.19 -15167.79
## visbrokenbar:complexity3 -10619.04 9193.59 -28274.06
## truncation:complexity3 -9791.59 11637.24 -31852.37
## slope:complexity3 -3752.87 15518.61 -31680.93
## visbottomgradbar:truncation:slope -38760.26 77256.09 -180295.92
## visbrokenbar:truncation:slope -2523.47 93499.58 -143779.07
## visbottomgradbar:truncation:complexity3 36772.22 23776.86 -5774.30
## visbrokenbar:truncation:complexity3 39968.33 32294.41 -7122.51
## visbottomgradbar:slope:complexity3 12987.26 35836.76 -59694.29
## visbrokenbar:slope:complexity3 46447.63 44228.60 -33907.37
## truncation:slope:complexity3 20059.28 58561.50 -88471.74
## visbottomgradbar:truncation:slope:complexity3 -140961.72 104355.29 -311619.14
## visbrokenbar:truncation:slope:complexity3 -171145.73 155491.21 -470387.86
## u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 35475.81 1.28 6 28
## Intercept[2] 68955.98 1.34 5 17
## Intercept[3] 102658.57 1.36 5 15
## Intercept[4] 133472.53 1.37 5 15
## disc_Intercept -7.80 1.42 4 15
## visbottomgradbar 8622.71 2.25 3 28
## visbrokenbar 17329.30 2.04 3 11
## truncation 58849.34 1.30 5 19
## slope 282245.40 1.32 5 19
## complexity3 4787.86 1.26 7 53
## visbottomgradbar:truncation 24055.48 1.69 3 17
## visbrokenbar:truncation 22460.41 2.09 3 11
## visbottomgradbar:slope 75425.39 1.97 3 24
## visbrokenbar:slope 50125.42 2.11 3 34
## truncation:slope 204104.03 1.78 3 46
## visbottomgradbar:complexity3 11192.09 1.56 4 16
## visbrokenbar:complexity3 4622.95 1.58 3 24
## truncation:complexity3 11245.90 1.42 4 42
## slope:complexity3 27543.11 1.33 5 80
## visbottomgradbar:truncation:slope 105196.23 1.70 3 25
## visbrokenbar:truncation:slope 229429.24 2.09 3 12
## visbottomgradbar:truncation:complexity3 75577.66 1.47 4 39
## visbrokenbar:truncation:complexity3 105046.33 1.78 3 11
## visbottomgradbar:slope:complexity3 71449.65 1.51 4 33
## visbrokenbar:slope:complexity3 129702.12 1.65 3 25
## truncation:slope:complexity3 144635.72 1.35 6 20
## visbottomgradbar:truncation:slope:complexity3 54997.40 1.51 4 25
## visbrokenbar:truncation:slope:complexity3 86490.74 1.77 3 13
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
Let’s take a look at these divergent chains as trace plots.
plot(m2)
We can see that the chains are wandering around without converging. This makes me think that we could improve our model fit with some regularization (narrower priors).
Sometimes (not in this case) chains will get stuck sampling a single parameter value repeatedly. This usually signals issues with identifiability, suggesting that our model is mis-specified and should be reparameterized.
Let’s take a look at the default priors we’ve been using before we create our own.
get_prior(
formula = bf(
response ~ vis*truncation*slope*complexity + (1|subject),
disc ~ (1|subject)),
family = cumulative("probit"),
data = data)
## prior class coef
## 1 b
## 2 b complexity3
## 3 b slope
## 4 b slope:complexity3
## 5 b truncation
## 6 b truncation:complexity3
## 7 b truncation:slope
## 8 b truncation:slope:complexity3
## 9 b visbottomgradbar
## 10 b visbottomgradbar:complexity3
## 11 b visbottomgradbar:slope
## 12 b visbottomgradbar:slope:complexity3
## 13 b visbottomgradbar:truncation
## 14 b visbottomgradbar:truncation:complexity3
## 15 b visbottomgradbar:truncation:slope
## 16 b visbottomgradbar:truncation:slope:complexity3
## 17 b visbrokenbar
## 18 b visbrokenbar:complexity3
## 19 b visbrokenbar:slope
## 20 b visbrokenbar:slope:complexity3
## 21 b visbrokenbar:truncation
## 22 b visbrokenbar:truncation:complexity3
## 23 b visbrokenbar:truncation:slope
## 24 b visbrokenbar:truncation:slope:complexity3
## 25 student_t(3, 0, 10) Intercept
## 26 Intercept 1
## 27 Intercept 2
## 28 Intercept 3
## 29 Intercept 4
## 30 student_t(3, 0, 10) sd
## 31 sd
## 32 sd Intercept
## 33 normal(0, 1) Intercept
## 34 student_t(3, 0, 10) sd
## 35 sd
## 36 sd Intercept
## group resp dpar nlpar bound
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22
## 23
## 24
## 25
## 26
## 27
## 28
## 29
## 30
## 31 subject
## 32 subject
## 33 disc
## 34 disc
## 35 subject disc
## 36 subject disc
We’ve got a bunch of slope parameters because of all of the interaction effects in our model. Let’s look at the first model (the one that fit) for comparison.
get_prior(
formula = bf(
response ~ vis*truncation*slope + complexity + (1|subject),
disc ~ (1|subject)),
family = cumulative("probit"),
data = data)
## prior class coef group resp
## 1 b
## 2 b complexity3
## 3 b slope
## 4 b truncation
## 5 b truncation:slope
## 6 b visbottomgradbar
## 7 b visbottomgradbar:slope
## 8 b visbottomgradbar:truncation
## 9 b visbottomgradbar:truncation:slope
## 10 b visbrokenbar
## 11 b visbrokenbar:slope
## 12 b visbrokenbar:truncation
## 13 b visbrokenbar:truncation:slope
## 14 student_t(3, 0, 10) Intercept
## 15 Intercept 1
## 16 Intercept 2
## 17 Intercept 3
## 18 Intercept 4
## 19 student_t(3, 0, 10) sd
## 20 sd subject
## 21 sd Intercept subject
## 22 normal(0, 1) Intercept
## 23 student_t(3, 0, 10) sd
## 24 sd subject
## 25 sd Intercept subject
## dpar nlpar bound
## 1
## 2
## 3
## 4
## 5
## 6
## 7
## 8
## 9
## 10
## 11
## 12
## 13
## 14
## 15
## 16
## 17
## 18
## 19
## 20
## 21
## 22 disc
## 23 disc
## 24 disc
## 25 disc
That’s a difference of 11 parameters from changing a + to a * in our model specification! This is one reason why it’s smart to build models up incrementally in order to understand more complex models in terms of simpler ones, a process called model expansion.
Let’s try placeing some narrower priors on these slope parameters. This should make it easier for the sampler to hone in on a specific region of parameter space.
m3 <- brm(
formula = bf(
response ~ vis*truncation*slope*complexity + (1|subject),
disc ~ (1|subject)),
family = cumulative("probit"),
prior = prior(normal(0, 0.5), class = b),
chains = 2,
cores = 2,
iter = 2500,
warmup = 1000,
data = data,
control = list(adapt_delta = 0.99),
file = "ordinal-model3")
Let’s check out model fit.
summary(m3)
## Family: cumulative
## Links: mu = probit; disc = log
## Formula: response ~ vis * truncation * slope * complexity + (1 | subject)
## disc ~ (1 | subject)
## Data: data (Number of observations: 1152)
## Samples: 2 chains, each with iter = 2500; warmup = 1000; thin = 1;
## total post-warmup samples = 3000
##
## Group-Level Effects:
## ~subject (Number of levels: 32)
## Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept) 0.15 0.03 0.10 0.22 1.00 541 940
## sd(disc_Intercept) 0.27 0.05 0.18 0.38 1.00 1058 1956
##
## Population-Level Effects:
## Estimate Est.Error l-95% CI
## Intercept[1] 0.31 0.06 0.20
## Intercept[2] 0.61 0.10 0.43
## Intercept[3] 0.89 0.14 0.63
## Intercept[4] 1.16 0.18 0.83
## disc_Intercept 1.74 0.17 1.42
## visbottomgradbar -0.03 0.06 -0.15
## visbrokenbar -0.02 0.06 -0.14
## truncation 0.54 0.12 0.31
## slope 2.17 0.34 1.53
## complexity3 -0.07 0.06 -0.19
## visbottomgradbar:truncation -0.11 0.12 -0.36
## visbrokenbar:truncation -0.08 0.12 -0.32
## visbottomgradbar:slope 0.25 0.26 -0.25
## visbrokenbar:slope 0.22 0.25 -0.25
## truncation:slope 0.60 0.37 -0.10
## visbottomgradbar:complexity3 0.02 0.08 -0.12
## visbrokenbar:complexity3 -0.03 0.07 -0.18
## truncation:complexity3 -0.03 0.12 -0.27
## slope:complexity3 0.15 0.24 -0.30
## visbottomgradbar:truncation:slope -0.22 0.41 -1.04
## visbrokenbar:truncation:slope -0.08 0.41 -0.88
## visbottomgradbar:truncation:complexity3 0.15 0.16 -0.16
## visbrokenbar:truncation:complexity3 0.14 0.15 -0.15
## visbottomgradbar:slope:complexity3 -0.09 0.31 -0.72
## visbrokenbar:slope:complexity3 0.05 0.30 -0.54
## truncation:slope:complexity3 -0.15 0.41 -0.90
## visbottomgradbar:truncation:slope:complexity3 -0.24 0.46 -1.14
## visbrokenbar:truncation:slope:complexity3 -0.23 0.45 -1.11
## u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept[1] 0.45 1.00 650 1437
## Intercept[2] 0.82 1.00 561 1219
## Intercept[3] 1.17 1.00 526 1060
## Intercept[4] 1.54 1.00 514 983
## disc_Intercept 2.08 1.00 529 890
## visbottomgradbar 0.08 1.00 993 1513
## visbrokenbar 0.09 1.00 1195 1829
## truncation 0.80 1.00 794 1399
## slope 2.86 1.00 649 1346
## complexity3 0.04 1.01 952 1290
## visbottomgradbar:truncation 0.12 1.00 1238 1839
## visbrokenbar:truncation 0.15 1.00 1411 1631
## visbottomgradbar:slope 0.79 1.00 1005 1611
## visbrokenbar:slope 0.73 1.00 1231 1874
## truncation:slope 1.35 1.00 2136 1907
## visbottomgradbar:complexity3 0.17 1.00 1094 1555
## visbrokenbar:complexity3 0.11 1.00 1281 2004
## truncation:complexity3 0.20 1.00 1300 1687
## slope:complexity3 0.64 1.01 1262 1675
## visbottomgradbar:truncation:slope 0.57 1.00 2029 2131
## visbrokenbar:truncation:slope 0.72 1.00 2001 1770
## visbottomgradbar:truncation:complexity3 0.46 1.00 1386 1867
## visbrokenbar:truncation:complexity3 0.44 1.00 1502 1919
## visbottomgradbar:slope:complexity3 0.51 1.00 1177 1808
## visbrokenbar:slope:complexity3 0.66 1.00 1601 1772
## truncation:slope:complexity3 0.67 1.00 1932 2285
## visbottomgradbar:truncation:slope:complexity3 0.66 1.00 2316 1887
## visbrokenbar:truncation:slope:complexity3 0.62 1.00 2459 2066
##
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample
## is a crude measure of effective sample size, and Rhat is the potential
## scale reduction factor on split chains (at convergence, Rhat = 1).
plot(m3)
This looks much better! Why did changing the prior work?
Imagine the sampler as a marble dropped into a bowl and sweeping out a path along the bowl’s surface. In this common metafor, the marble is the sampler, the surface of the bowl is parameter space, and discrete locations on marble’s path are our posterior draws. When we add many parameters to a model, as we did when we added multiple interactions to our model, we make an n-dimensional surface that can become “lumpy” with local extrema making it difficult for the model to converge. By setting regularizing priors we smooth the surface of the parameter space, making it easier for chains to converge.
Another way of thinking about this is that adding parameters to a model entails an assumption that there are more sources of variation. As we add more sources of variation, we need to assume that each individual parameter is going to contribute less variance to the posterior distribution. Otherwise, we tell the sampler to explore an overdispersed parameter space, leading to the wandering chains we saw before.
Let’s compare the empirical distribution of responses to the distribution predicted by the model.
Here’s the empirical distribution.
data %>%
mutate(rating = ordered(response, levels = c("5","4","3","2","1"))) %>%
ggplot(aes(x = vis, fill = rating)) +
geom_bar() +
theme_minimal() +
facet_grid(. ~ truncation)
Here’s the posterior predictive distribution as HOPs.
spec <- data %>%
add_predicted_draws(m3, seed = 14, n = 50) %>%
mutate(rating = ordered(.prediction, levels = c("5","4","3","2","1"))) %>%
ggplot(aes(x = vis, fill = rating)) +
geom_bar() +
theme_minimal() +
facet_grid(. ~ truncation) +
transition_manual(.draw)
animate(spec, fps = 2.5)
## nframes and fps adjusted to match transition
This looks pretty similar to the simpler model we fit previously. Let’s compare the two models that fit well using leave-one-out (llo) crossvalidation, which estimates out-of-sample predictive validity as expected log posterior density (elpd). As a rule of thumb, smaller values of elpd reflect a better fit.
loo(m1, m3)
## Warning: Found 1 observations with a pareto_k > 0.7 in model 'm1'. It is
## recommended to set 'reloo = TRUE' in order to calculate the ELPD without the
## assumption that these observations are negligible. This will refit the model 1
## times to compute the ELPDs for the problematic observations directly.
## Warning: Found 1 observations with a pareto_k > 0.7 in model 'm3'. It is
## recommended to set 'reloo = TRUE' in order to calculate the ELPD without the
## assumption that these observations are negligible. This will refit the model 1
## times to compute the ELPDs for the problematic observations directly.
## Output of model 'm1':
##
## Computed from 3000 by 1152 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1105.9 23.6
## p_loo 58.7 3.2
## looic 2211.9 47.1
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 1144 99.3% 970
## (0.5, 0.7] (ok) 7 0.6% 254
## (0.7, 1] (bad) 1 0.1% 441
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
##
## Output of model 'm3':
##
## Computed from 3000 by 1152 log-likelihood matrix
##
## Estimate SE
## elpd_loo -1115.4 23.3
## p_loo 68.7 3.5
## looic 2230.7 46.6
## ------
## Monte Carlo SE of elpd_loo is NA.
##
## Pareto k diagnostic values:
## Count Pct. Min. n_eff
## (-Inf, 0.5] (good) 1144 99.3% 670
## (0.5, 0.7] (ok) 7 0.6% 270
## (0.7, 1] (bad) 1 0.1% 181
## (1, Inf) (very bad) 0 0.0% <NA>
## See help('pareto-k-diagnostic') for details.
##
## Model comparisons:
## elpd_diff se_diff
## m1 0.0 0.0
## m3 -9.4 2.5
It looks like the more complex model (m3) fits slightly better, even after we adjust for overfitting. It is common practice to run a handful of models, expanding from simpler to more complex specifications, and compare them all using posterior predictive visualizations and procedures like loo cross-validation.